arXiv:l506.06716vl [nlin.CG] 22 Jun 2015 


June 23, 2015 0:50 


International Journal of Modern Physics C 
(c) World Scientific Publishing Company 


A MODEL OF LANGUAGE INFLECTION GRAPHS 


HENRYK FUKS 
BABAK FARZAD 
YI CAO 

Department of Mathematics and Statistics 
Brock University, St. Catharines 
Ontario, Canada L2S 3Al 

hfuks@brocku.ca, bfarzad@brocku.ca, caoyil63@gmail.com 

Received 3 July 2013 
Revised 27 November 2013 
Accepted 29 November 2013 


Inflection graphs are highly complex networks representing relationships between inflec¬ 
tional forms of words in human languages. For so-called synthetic languages, such as 
Latin or Polish, they have particularly interesting structure due to abundance of inflec¬ 
tional forms. We construct the simplest form of inflection graphs, namely a bipartite 
graph in which one group of vertices corresponds to dictionary headwords and the other 
group to inflected forms encountered in a given text. We then study projection of this 
graph on the set of headwords. The projection decomposes into a large number of con¬ 
nected components, to be called word groups. Distribution of sizes of word group exhibits 
some remarkable properties, resembling cluster distribution in a lattice percolation near 
the critical point. We propose a simple model which produces graphs of this type, repro¬ 
ducing the desired component distribution and other topological features. 
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1. Introduction 

Human languages can be studied from many different perspectives. If we think of 
a foreign language, however, we typically think of words of that language, thus 
it is quite natural that vocabulary is one of the most extensively studied features 
of languages. In recent years, the network paradigm has been used to study vo¬ 
cabularies, and within this paradigm, words of the language are viewed as ver¬ 
tices of a large and complex network or graph, with edges representing relation¬ 
ships between words. Many such models emphasizing different relationships between 
words have been studied in the past decade, including networks of co-occurrences of 
words in sentenced, thesaurus graph^EIll, WordNet database graph^, and many 
otherOTEIH 

It is fair to say that a lot of the aforementioned works concentrated on the 
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English language, which has a very characteristic property of being analytic, that 
is, exhibiting only a minimal inflection. In analytic languages grammatical relations 
and categories are handled mostly by the word order, and not by the inflection, thus 
making them somewhat easier to learn. 

In contrast to this, synthetic languages such as Latin, Greek, Polish, or Russian 
make an extensive use of inflection, and one word in these languages can appear in 
great many forms, reflecting grammatical categories such as tense, mood, person, 
number, gender, case, etc. Order of words is less important in synthetic languages. 
While this is an excellent feature from the point of view of a poet, it presents 
algorithmic problems in text processing. Let us suppose, for example, that we want 
to count the number of distinct words in a given work - e.g., for the purpose of 
comparing two works and deciding which one uses larger vocabulary. How do we 
do this in a language like Latin, where one dictionary headword can have as many 
as hundred different forms? To make things even more difficult, in some cases, one 
inflectional form can correspond to more than one dictionary headword, and one 
must deduce from the context which one to choose. 

In Ref. im one of the authors considered this problem from a practical point of 
view, and proposed a solution which exploits some features of the so-called inflection 
graph. Here we will not dwell on this problem, referring an interested reader to 
Ref. im but we will instead discuss the inflection graph itself. We will first describe 
some of its topological features, and then propose a model which reproduces these 
features. 


2. Inflection graphs 

The inflection graph for a given language can be constructed as follows. First we 
need to create a list of all words of the language, which, strictly speaking, is an 
impossible task, as every such list is bound to be incomplete. Nevertheless, one can 
easily obtain a reasonably adequate list of words using sufficiently large dictionary 
of the language. The set of all dictionary headwords will be denoted by H. For 
each headword, we generate a list of all possible inflected forms, and the list of all 
possible inflected forms obtained this way will be denoted by /. We then construct 
a bipartite graph G = {H,I,E), where E is the set of edges such that the edge 
between v G H and u G I exists if and only if u is an inflected form of v. 

Construction of the inflection graph is obviously possible only if one is able 
to produce all inflected forms of a given word. For the Latin language, this can 
be achieved using WORDS, a computerized dictionary of Latin created by William 
WhitakerE^. The resulting bipartite graph has 1028 972 vertices and 1 077 806 edges, 
and will be denoted by G^a- 

We were also able to construct inflection graph for Polish language, using lexical 
grammar developed by the Group of Computer Linguistics of AGH University of 
Science and Technology in Krakovi^^. The corresponding graph, to be denoted by 
GpL, has 1 872 140 vertices and 802 911 edges. 
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Normally, for most headwords in H, there are many corresponding inflected 
forms in /, so an element of H is typically connected to many (sometimes 100 
or more) elements of I. For example, the Latin word dicunt (they say) and dixit 
(he said) are both inflected forms of the verb dico, thus we will have a vertex in 
H corresponding to dico connected to vertices in I corresponding to dicunt and 
dixit. However, the opposite can also be true: in some instances, a word can be 
an inflected form of more than one headword, so that vertices of I are sometimes 
connected to more than one vertex of H. As an example, consider the word sublatus, 
which could be a form of tollo (lift, raise) or suffero (bear, endure), thus a vertex 
in / corresponding to sublatus will be connected to two vertices in H. 

The inflection graphs are rather sparse, and they decompose into a large number 
of connected components of different sizes. From the practical point of view, the 
size of the component is not as important as the number of distinct headwords in 
the component, which we will call headword groups. The motivation for this can be 
explained as follows. 

Suppose that one wants to perform a computerized count of the number of 
different words occurring in a given text. Obviously, one wants to count two different 
inflection forms of a given word as one and the same word, or, to put this differently, 
one wants to know how many distinct dictionary headwords appear in the text 
(in various inflected forms). However, since in languages with a complex inflection 
system a given inflection form can sometimes belong to two (or more) different 
dictionary headwords, it is impossible for a computer to decide which one is used in 
a particular case. To make such a decision, one has to understand the sentence and 
figure out from the context what is means. In English this problem is quite rare, 
but still exists. For example, consider the word dove - this could be the singular 
form of the noun dove (a type of bird), or the simple past tense of the verb to dive. 
Computer program upon encountering dove in a text will not know whether to count 
it as occurence of the headword dove or to dive. The simple solution to this problem 
is to say that dove is an inflected form of a headword from the set (headword group) 
{dove, to dive}. This means that instead of counting how many distinct headwords 
are present in the text, we can only count how many distinct headword groups are 
present. We want to know, however, what are the sizes of the headword groups, as 
it is, in a sense, a measure of the difficulty of the disambiguation problem. A good 
way to analyze these sizes is to look at their distribution. 

The distribution of headword groups sizes in inflection graphs is quite striking, 
as can be seen in Figure which shows the distributions for Latin and Polish. The 
graph for Latin and its analysis have been previously published in Ref. Ull here we 
add the same graph for Polish language. 

We fitted a straight line in log-log coordinates to data points for which the 
number of groups exceeds 20, in order to exclude points with small count. The lines 
of the best fit are shown as dashed lines. There seems to be a power-law trend in 
both data, more strongly pronounced in the graph for the Latin language. In the 
remaining part of this paper we will attempt to shed some light on the origin of this 
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Fig. 1. Distribution of headword clusters for Latin (left) and Polish (right). Slope of the fitted 
line is, correspondingly, —3.1 di 0.3 and —4.3 di 0.9. The figure for Latin previously appeared in 
Ref.rTTl 


phenomenon. 

The dashed lines of the best fit shown in Figure represent the power law 


ris ~ s 


( 1 ) 


where r « —3.1 ± 0.3 for Latin and t « —4.3 ± 0.9 for Polish. Errors given for 
T signify that decreasing/increasing r by the given amount increases the reduced 
twice. Anyone who is familiar with the percolation theorjIUii^ can immediately 
recall that a very similar scaling law for cluster sizes holds for the lattice percolation 
at the critical point, where r is known as the Fisher exponentfi^. This is also the 
case for the Erdos-Renyi model G{n,p), that is, a graph constructed by connecting 
n nodes randomly so that each edge is included in the graph with probability p 
independent from every other edge. It is well known that at np = 1 and n —> oo the 
model undergoes a structural transition similar to percolatioiP^. The distribution of 
component sizes follows the power law of eq. Q, and the Fisher exponent is knowiJl^ 
to be r = 2.5. Figurej^shows component size distributions obtained numerically for 
G{n,p) with n = 28092, that is, the same n as the number of headwords in Gla- 
Three values of np were used, np = 0.5 (below the percolation threshold), np = 2.0 
(above the percolation threshold) and np = 1.0 (at the percolation threshold). The 
power law in the form of eq. Q is evident at the percolation threshold, yet it is 
clearly not valid away from the threshold. In spite of the fact that the number of 
vertices is relatively small and that only 10 graphs were generated, the value of 
the exponent r = 2.44 ± 0.09 obtained from fitting the straight line to data agrees, 
within error bounds, with the aforementioned value of r = 2.5. 

Considering the case of G{n,p), one could suspect that the inflection graphs have 
a structure somewhat resembling Erdos-Renyi random graphs at the percolation 
threshold. We will, however, demonstrate that this is somewhat more complicated. 
To avoid repetitions, from now on we will be using Gla s-s an example. 
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Fig. 2. Distribution of component sizes for the random graph averaged over 10 realizations of the 
graph above (x), below (★), and at the percolation threshold (+). The data point corresponding 
to the giant component above the percolation threshold is not shown. Slope of the fitted line is 
—2.44 ± 0.09. Error bars correspond to standard deviations, and for clarity are shown only for the 
data at the critical point. 


3. Structure of the inflection graph for Latin 

In order to describe some important features of Gla, we will consider its projection 
on H. Given a bipartite graph G = {H,I,E), define its TT-projection as G' = 
(H, E'), where {m, n} is in E' if and only if u and v are both connected to a common 
vertex in /. iJ-projection of Gla has 28092 vertices and 24064 edges. Only 13345 
headwords have degree greater than zero in Note that for obvious reasons, 

distribution of component sizes of G^^ is the same as the distribution of group 
sizes in Gla- Could it then be that G'^^ resembles Erdos-Renyi random graph? 

In order to answer this question, we will first consider the degree distribution of 
G'j^j^ shown in Figure]^ Unlike in the case of G{n,p), the degree distribution of G'j^j^ 
is clearly not Poissonian, and for small degree values it seems to follow exponential 
decay, shown in the figure as a straight line. The mean vertex degree is 1.8. This 
already indicates that G{n,p) cannot be a model of G'j^j^ - the mean vertex degree 
of G(n,p) with a power law distribution of components sizes must be equal to 1.0. 

We can see the difference between G{n,p) and even better if we use the 
notion of core clustering spectrum, introduced in Ref. 1181 For a non-negative inte¬ 
ger k, the k-core of a graph is the maximal subgraph such that its vertices have 
degree greater or equal to k. By the “degree” in this definition we mean the de¬ 
gree of the vertex in the subgraph. If G is a given graph, we denote by G^^j the 
k-coie of G. Now let G(G) denote the clustering coefficient of G. A set of pairs 
(|G{fc} I, G(G{fe})), where |G| denotes the number of vertices of G, will be called core 
clustering spectrum of G. One can visualize the core clustering spectrum by plotting 
points (|G{/j}|, G(G{fe})) on a plane, as it has been done in Ref.[TS) Here we will use 
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Fig. 3. Degree distribution (left) and clustering coefficients of cores (right) of ff-projection of 
Gla (top) GpL (bottom). 


slightly different graphs in order to convey a similar information, namely we will 
plot C'(G'{fc}) as a function of k. We will call it the graph of clustering coefficients 
of cores. This has the advantage over the plot of core spectrum in having the core 
number explicitly as one of the variables. The value of k will range from 1 to k^ax, 
where k^ax is the largest k for which G^k} is non-empty. 

For some graphs, such as the Erdos-Renyi random graphs, most vertices belong 
to the same k-coie, as documented in Ref. HSl This means that the graph of cluster¬ 
ing coefficients of cores for Erdos-Renyi random graphs is very narrow, consisting of 
only a small number of points. This is not the case for as Figure|^attests. Gfj^ 
possesses highly clustered inner core, feature absent in Erdos-Renyi model near the 
percolation threshold. 

Degree distribution of G'pj^ and its graph of clustering coefficients of cores are 
quite similar to corresponding graphs of as shown in the bottom of Figure]^ 


4. Model 

In order to construct a model of inflection graphs which exhibits power law scal¬ 
ing resembling Figure as well as having the degree distribution and clustering 
coefficients of cores of its iJ-projection resembling Figure we need to make a 
couple of further remarks regarding topological structure of inflection graphs, again 
using GpA as an example. It is useful to think of G^^ as a collection of stars, each 
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centered at a headword and with arms connecting the headword to some inflected 
forms. These stars are not completely disjoint, however. Sometimes they share one 
or more vertices in I, and this occurs if a given headword shares some of its inflected 
forms with another headword (or headwords). 

Let n be the number of headwords, and m be the number of inflected forms. 
Construction of the random graph serving as a model of Gla proceeds in two stages. 
In stage 1, we generate an assembly of stars, each centered at a headword and with 
arms connecting the headword to some inflected forms. In stage 2, we generate a 
number of random bridges between these stars. We now describe the two stages in 
detail. 


Algorithm for generating stars 

(1) Generate the set of vertices H = {Hi, H 2 , ■ ■ ■, H^} corresponding to headwords, 
and another set I = {Ii, I 2 , ■ ■ ■, Im} corresponding to inflected forms. 

(2) For each i € {1,2,.. .n}, draw a random number Xi from a distribution fh to 
be described below, and connect vertex Hi to vertices Jj+i, Jj+ 2 ) ■ • ■) 

where j = 0 for i = 1 and j = X]p=i Ll^^plJ otherwise. If any vertex index in 
) Ij+WxiW exceeds m, it is replaced by its value modulo m. 

(3) If any isolated vertices in I still remain, connect each of them to a randomly se¬ 
lected vertex in H. After this is done, relabel the set I so that vertices connected 
to the same headword are labeled with a block of consecutive integers. 


The probability distribution function fh is a weighted sum of three normal dis¬ 
tributions. 


3 

fhix) ='^wjai,p.flx), (2) 

where 

fmvix) = . (3) 

ayzTT 

We used values {wi,W 2 ,W 3 ) = (0.68,0.28,0.04), (/n,/X 2 ,/xa) = (8,90,3) and 
((Ti,CT 2 ,a- 3 ) = (2,10,1). These were obtained by fitting the resulting degree dis¬ 
tribution to the degree distribution of the actual inflection graph, but their values 
are not too critical, meaning that small changes in values of these parameters still 
produce graphs with power-law distribution of headword group sizes. 

Note that although the random number Xi drawn from the distribution fh in 
step two may theoretically be zero, yet the probability of such event is extremely 
small. In our program implementing the algorithm for generating stars, we simply 
reject Xi = 0 outcome and draw another number if it happens. 

The reason for taking fh to be the sum of three normal distributions is the 
structure of Latin vocabulary. With respect to inflection, one can distinguish three 
main groups of words: (1) verbs (inflexion by conjugation), (2) nouns and adjectives 
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(inflection by declension) and (3) all other words. We should remark here that this 
shape of the distribution is suitable for Latin, but for a different language, with a 
different grammatical structure, it would have to be different - in particular, the 
number of normal distributions in the sum would likely have to change. Moreover, 
we used normal distribution for the sake of simplicity, and we do not claim that 
this reflects the actual distribution of inflection forms very accurately, but it is close 
enough for our purposes. One should also note that fh may theoretically produce 
negative numbers (again, with very small probability), and this is why we take the 
absolute value of Xi. We also round Xi down to the nearest integer. One could use in 
place of the normal distribution some other distribution with strongly pronounced 
peak and producing only positive numbers, such as, for example, the log-normal 
distribution. We found, however, that the detailed shape of the distribution is not 
too crucial for our goal of reproducing the desired properties of the inflection graph, 
thus we kept the normal distribution for simplicity. 

Once the assembly of stars is created, we add a number of bridges between the 
stars. The most crucial feature of these bridges comes from the fact that typically 
two headwords share not one, but many inflected forms with another headword 
or headwords. This is because there exists a large number of pairs of closely re¬ 
lated Latin words, each having a separate entry in the dictionary. For example, the 
words dico (say), dictum (utterance, remark) and dictus (speech) are all closely 
related, thus they share many inflected forms. After experimenting with many pos¬ 
sible methods for generation of bridges, we came out with a simple algorithm, which 
basically adds a fixed number of edges at a time. 

Let A and T be two positive integers, to be used as parameters in our algorithm. 

Algorithm for generating bridges 

(1) Randomly select two headword vertices Ha and H^, where hy Ha we denoted 
the vertex with the larger degree. Vertex Ha is already connected to k inflected 
forms, let us denote them hy {Ij.,Ir+i, ■ ■ ■ A-i-fe-i}- 

(2) Add A additional edges by connecting Hh with vertices {Ir,Ir+i) ■ ■ ■ R+x-i}■ 

(3) Repeat the above two steps T times. 

Note that the second step is performed exactly as described even if A > /c, but in 
this case some of the inflected forms with which we connect H^ will not be inflected 
forms of Ha, but inflected forms of some other word(s). Also note that k is always 
greater than zero, because the algorithm for generating stars ensures that this is 
the case. This agrees with our interpretation of the meaning of the “inflected form”. 
We assume that every word has at least one inflected form - if it is an adverb, 
for example, its sole “inflected form” is identical to itself. This is consistent wit the 
treatment of other parts of the speech. For instance, for nouns we count nominative 
singular among inflected forms, even though it is identical to the headword form. 
Regarding the value of A and T, they must be selected as follows. After com- 
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pleting the algorithm for generating stars, the number of edges in the graph is only 
slightly larger than |/| (recall that in step 2 we are replacing indices exceeding m 
with their values modulo m, but this happens only rarely for a few values of i close 
to n). Of course it could theoretically happen that the number of vertices will be 
larger than the desired number of edges (we want to have the same number of edges 
in the model graph as in the inflection graph being modeled). With the choice of pa¬ 
rameters which we have made, the probability of such event is so exceedingly small, 
that for all practical purposes we can simply ignore such eventuality. Nevertheless, 
if it indeed happened, one would have to discard the result and run the algorithm 
for generating stars again. 

Having less than the desired number of edges, we must ensure that the product 
XT is equal to the number of remaining edges which we want to produce. This means 
that only one of those two parameters can be freely chosen. By experimenting with 
different values of A in the range from 1 to 15, we found that A = 10 produces the 
most clearly pronounced power-law distribution of headword sizes in the resulting 
graph. The typical corresponding value of T in this case is T = 7692. We say 
“typical” because, as explained earlier, the exact number of vertices in the graph 
obtained after applying the algorithm for generating stars will slightly fluctuate 
for different realizations of the graph, thus the number of “missing vertices”, and 
consequently the value of T, will slightly fluctuate too. The shape of the headword 
group size distribution graph, however, is only weakly affected by changes of A and 
T as long as their product remains equal to the number of “missing vertices” and 
providing that A > 1. For example, if instead of A = 10 and T = 7692 we use A = 5 
and T = 15384, there is almost no perceptible difference in the shape of the graph. 

We generated random graph following the above algorithm using \H\ = 28092 
and |/| = 1000880, that is, the same number of vertices as in the actual inflection 
graph. This graph will be called Gmod- Its distribution of headword group sizes 
is shown in Figure]^ Agreement with the actual distribution shown in Figureis 
indeed very good. Even the slope of the fitted line agrees (within the error bound) 
with the exponent observed in Gla, as these are respectively — 3.4±0.5 and —3.1± 
0.3. 

The model also performs well when one considers iJ-projection of Gmod- Fig¬ 
ure shows both the degree distribution and the graph of clustering coefficients 
of cores of Comparing these graphs with Figure]^ we can observe good 

qualitative agreement. Degree distribution of G'j^qjj is very similar to degree distri¬ 
bution of except that G'j^qjj misses a small number of high-degree vertices, 

present in Clustering coefficients of cores of both graphs exhibit very similar 

behavior, that is, the clustering sharply increases with increasing core number, and 
reaches value close to 1 for the inner core, indicating the presence of cliques in high 
(innermost) cores. 








June 23, 2015 0:50 


10 Henryk Fuks, Babak Farzad and Yi Cao 


w 

Q. 

o 

'o 

o 

E 

c 


100000 

10000 

1000 

100 

10 

1 

1 10 100 
headword group size 



Fig. 4. Distribution of headword group sizes for the model graph Gmod averaged over 10 real¬ 
izations of the graph. Slope of the fitted line is —3.4 ± 0.5. 


5 10 15 20 

degree 



Fig. 5. Degree distribution (left) and clustering coefficients of cores (right) of ff-projection of 
the model graph, averaged over 10 realizations of the graph. Error bars correspond to standard 
deviation. 


5. Conclusions 

We have discussed selected topological properties of inflection graphs and proposed a 
random graph model which exhibits the desired properties. In particular, our model 
possesses nearly identical distribution of headword group sizes, and its i7-projection 
exhibits degree distribution and clustering coefficients of cores qualitatively similar 
to analogous properties of the original inflection graph for the Latin and Polish 
languages. 

A number of unresolved questions remain. First of all, it would be helpful to 
formally prove that the distribution of headword group sizes in our model follows 
a power law, as well as to prove that the degree distribution of the iL-projection 
decreases exponentially with degree. We feel that further simplification of the model 
may be needed in order to achieve this goal. 
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A separate question is the meaning and implications of the observed features 
of inflection graphs in the linguistic context. It seems plausible, for example, that 
the structure of the inflection graphs is in some sense optimal. If the number of 
“bridges”, that is, connections between headword stars was much higher, the whole 
inflection graph would be connected, and the disambiguation of headwords based 
on inflected forms would be difflcult. On the other hand, if there were no bridges 
between headword stars at all, then a much larger number of inflected forms would 
be needed. One can therefore speculate that the actual inflection graph represents 
some sort of compromise between these two extremes. In order to substantiate this 
claim one would need to construct a dynamical process producing many possible 
forms of inflection graphs, and then show that the attractor of this process is the 
actual inflection graph, just like in the case of self-organized criticality. 

It is also possible to draw some further analogy between the percolation process 
and inflection graphs. One can think of percolation as a process in which one starts 
with a graph with n vertices and no edges, and then adds random edges one by one. 
The graph will then undergo a percolation transition, and the power-law distribution 
of component sizes will be observed at the transition point. Below and above the 
percolation point, no power law will be observed. In order to mimic this process, 
we took the graph GpA and started adding random edges to itPSI. As expected, 
this destroyed the power-law distribution of components sizes of although, 

obviously, it is very difficult to pinpoint how many edges exactly are needed to 
destroy the power law - the power law is not exact in the first place. The same 
phenomenon can be observed when one adds random edges to Gmod- One can 
thus say that inflection graphs as well as the model graph are somewhat “frozen” 
at the threshold, or slightly below the threshold, of some percolation process. As 
intriguing as it is, this statement has to be taken very cautiously, because in the 
actual inflection graph edges cannot be added or removed - the graph is a fixed 
feature of the language. We plan to probe this issue further in the near future. 
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